[[This data comes from a real study on implicit and self-reported evaluations. The implementation of the procedure produced three data files: one for the demographics data, one for the self-reported evaluations, and one for the implicit measure (the ‘Affect Misattribution Procedure’). This script uses each of these to learn and practice functions from the readr, dplyr, and tidyr libraries that are commonly used for data wrangling. In doing so, we will learn how to do many of the steps involved in data processing for a given experiment.]]
Code
library(dplyr)library(tidyr)library(readr)library(janitor) # for clean_names() and round_half_up()library(roundwork) # for round_up()library(stringr)library(knitr) # for kable()library(kableExtra) # for kable_classic()# demographics datadata_demographics_raw <-read_csv(file ="../data/raw/data_demographics_raw.csv") # self report measure datadata_selfreport_raw <-read_csv(file ="../data/raw/data_selfreport_raw.csv") # affect attribution procedure datadata_amp_raw <-read_csv(file ="../data/raw/data_amp_raw.csv")# clean column namesdata_demographics_clean_names <- data_demographics_raw %>%clean_names() data_selfreport_clean_names <- data_selfreport_raw %>%clean_names() data_amp_clean_names <- data_amp_raw %>%clean_names()
6.2 Renaming columns
Often variable names are not intuitive. An early step in any data wrangling is to make them more intuitive.
Combine the above function calls using pipes. Notice how this involves fewer objects in your environment, and therefore less potential for confusion or error.
Remember: this is how we solve coding problems: break them down into smaller tasks and problems, get each of them working individually, then combine them together again. When you only see the end product, it’s easy to think the author simply wrote the code as you see it, when they often wrote much more verbose chunks of code and then combined them together.
Rewrite the rename and select calls for the AMP and self report data too.
Code
# remove all objects in environmentrm(list =ls())data_demographics_trimmed <-# read in the dataread_csv("../data/raw/data_demographics_raw.csv") %>%# convert to snake caseclean_names() %>%# make names more intuitiverename(unique_id = subject,item = trialcode) %>%# retain only columns of interestselect(unique_id, item, response)data_selfreport_trimmed <-read_csv("../data/raw/data_selfreport_raw.csv") %>%clean_names() %>%rename(unique_id = subject,item = trialcode) %>%select(unique_id, item, response)data_amp_trimmed <-read_csv("../data/raw/data_amp_raw.csv") %>%clean_names() %>%rename(unique_id = subject,block_type = blockcode,trial_type = trialcode,trial_id = blocknum_and_trialnum,rt_ms = latency) %>%select(unique_id, # methods variables block_type, trial_type, trial_id,# responses rt_ms, correct)
6.5 Counting frequencies
After renaming and selecting columns, we know what columns we have. But what rows do we have in each of these? What might we need to exclude, change, work with in some way later on? It is very useful to use count() to obtain the frequency of each unique value of a given column
Code
data_demographics_trimmed %>%count(item)
# A tibble: 2 × 2
item n
<chr> <int>
1 age 100
2 gender 100
Note that it is also possible to use count to obtain the frequencies of sets of unique values across columns, e.g., unique combinations of item and response.
Code
data_demographics_trimmed %>%count(item)
# A tibble: 2 × 2
item n
<chr> <int>
1 age 100
2 gender 100
# A tibble: 51 × 3
item response n
<chr> <chr> <int>
1 age 18 1
2 age 19 4
3 age 20 1
4 age 21 6
5 age 22 2
6 age 23 5
7 age 24 1
8 age 25 3
9 age 26 4
10 age 27 5
# ℹ 41 more rows
It can be useful to arrange the output by the frequencies.
Code
data_demographics_trimmed %>%count(item, response) %>%arrange(desc(n)) # arrange in descending order
# A tibble: 51 × 3
item response n
<chr> <chr> <int>
1 gender Male 36
2 gender female 27
3 gender male 18
4 gender Female 11
5 age 21 6
6 age 23 5
7 age 27 5
8 age 32 5
9 age 19 4
10 age 26 4
# ℹ 41 more rows
6.6 Filtering rows
Once we know the contents of our columns, we may wish to exclude some rows using filter().
You can specify the logical test for filtering in many ways, including equivalence (==), negation (!=), or membership (%in%). It is often better to define what you do want (using equivalence or membership) rather than what you do not want (negation), as negations are less robust to new data with weird values you didn’t think of when you wrote the code. E.g., you could specify gender != "non-binary" but this would not catch non binary. If you were for example looking to include only men and women, instead use gender %in% c("man", "woman").*
*[This is just an example; there is usually no good a priori reason to exclude gender diverse participants]
Code
# example using equivalenceexample_equivalence <- data_amp_trimmed %>%filter(block_type =="test")# example using negationexample_negation <- data_selfreport_trimmed %>%filter(item !="instructions")# example using membershipexample_membership <- data_selfreport_trimmed %>%filter(item %in%c("positive", "prefer", "like"))
6.6.1 Multiple criteria, ‘and’ or ‘or’ combinations
You can also have multiple criteria in your filter call, both of which have to be met (x & y), or either one of which have to be met (x | y).
Code
example_multiple_criteria_1 <- data_amp_trimmed %>%filter(block_type !="test"& correct ==1)example_multiple_criteria_2 <- data_amp_trimmed %>%filter(block_type !="test"| correct ==1)# note that these provide different results - make sure you understand whyidentical(example_multiple_criteria_1, example_multiple_criteria_2)
[1] FALSE
6.6.2 Practice filtering
Filter the self reports data frame to remove the instructions. Filter the AMP data frame to remove the practice blocks and the instruction trials.
Code
data_selfreport_trials <- data_selfreport_trimmed %>%#filter(item != "instructions")filter(item %in%c("positive", "prefer", "like"))# this probably contains things we don't wantdata_amp_trimmed %>%count(trial_type, block_type)
# A tibble: 5 × 3
trial_type block_type n
<chr> <chr> <int>
1 instructions test 2
2 prime_negative test 3604
3 prime_negative_practice practice 508
4 prime_positive test 3604
5 prime_positive_practice practice 506
Code
# we exclude themdata_amp_test_trials <- data_amp_trimmed %>%filter(block_type =="test") %>%filter(trial_type !="instructions")# check they are excludeddata_amp_test_trials %>%count(trial_type, block_type)
# A tibble: 2 × 3
trial_type block_type n
<chr> <chr> <int>
1 prime_negative test 3604
2 prime_positive test 3604
# A tibble: 7,208 × 6
unique_id block_type trial_type trial_id rt_ms correct
<dbl> <chr> <chr> <chr> <dbl> <dbl>
1 504546409 test prime_positive 2_2 261 0
2 504546409 test prime_positive 2_3 428 0
3 504546409 test prime_positive 2_4 320 1
4 504546409 test prime_negative 2_5 408 1
5 504546409 test prime_negative 2_6 335 1
6 504546409 test prime_negative 2_7 324 1
7 504546409 test prime_negative 2_8 469 0
8 504546409 test prime_positive 2_9 1205 1
9 994692692 test prime_positive 2_2 1711 0
10 994692692 test prime_negative 2_3 727 0
# ℹ 7,198 more rows
6.7 Check your learning
What is the difference between select and filter?
Which is for rows and which is for columns?
6.8 Mutating: creating new columns or changing the contents of existing ones
6.8.1 Understanding mutate()
mutate() is used to create new columns or to change the contents of existing ones.
Code
# mutating new variablesexample_1 <- data_amp_test_trials %>%mutate(latency_plus_1 = rt_ms +1)example_2 <- data_amp_test_trials %>%mutate(log_latency =log(rt_ms))# mutating the contents of existing variablesexample_3 <- data_amp_test_trials %>%mutate(rt_s = rt_ms /1000) # latency is now in seconds rather than milliseconds
The operations inside mutate can range from the very simple, like the above, to much more complex. The below example uses other functions we haven’t learned yet. For now, just notice that there can be multiple mutate calls and they can produce a cleaned up gender variable.
Code
# illustrate the problem with the gender responses:data_demographics_trimmed %>%# filter only the gender item, not agefilter(item =="gender") %>%count(response) %>%arrange(desc(n))
# A tibble: 11 × 2
response n
<chr> <int>
1 Male 36
2 female 27
3 male 18
4 Female 11
5 Non-Binary 2
6 23 1
7 FEMALE 1
8 MALE 1
9 Woman 1
10 non binary 1
11 yes 1
Code
# clean up the gender variabledata_demographics_gender_tidy_1 <- data_demographics_trimmed %>%# filter only the gender item, not agefilter(item =="gender") %>%# change the name of the response variable to what it now represents: genderrename(gender = response) %>%# change or remove weird responses to the gender questionmutate(gender =str_to_lower(gender)) %>%mutate(gender =str_remove_all(gender, "[\\d.]")) %>%# remove everything except lettersmutate(gender =na_if(gender, "")) %>%mutate(gender =case_when(gender =="woman"~"female", gender =="man"~"male", gender =="girl"~"female", gender =="yes"~NA_character_, gender =="dude"~"male", gender =="non binary"~"non-binary",TRUE~ gender)) %>%# select only the columns of interestselect(unique_id, gender)# illustrate the data after cleaning:data_demographics_gender_tidy_1 %>%count(gender) %>%arrange(desc(n))
# A tibble: 4 × 2
gender n
<chr> <int>
1 male 55
2 female 40
3 non-binary 3
4 <NA> 2
A single mutate call can contain multiple mutates. The code from the last chunk could be written more simply like this:
Code
# clean up the gender variabledata_demographics_gender_tidy_2 <- data_demographics_trimmed %>%# filter only the gender item, not agefilter(item =="gender") %>%# change the name of the response variable to what it now represents: genderrename(gender = response) %>%# change or remove weird responses to the gender questionmutate(gender =str_to_lower(gender),gender =str_remove_all(gender, "[\\d.]"), # remove everything except lettersgender =na_if(gender, ""), gender =case_when(gender =="woman"~"female", gender =="man"~"male", gender =="girl"~"female", gender =="yes"~NA_character_, gender =="dude"~"male", gender =="non binary"~"non-binary",TRUE~ gender)) %>%# select only the columns of interestselect(unique_id, gender)# check they are identicalidentical(data_demographics_gender_tidy_1, data_demographics_gender_tidy_2)
[1] TRUE
6.8.2 Practice mutate()
When analyzing cognitive behavioral tasks, it is common to employ mastery criteria to exclude participants who have not met or maintained some criterion within the task. We’ll do the actual exclusions etc. later on, but for practice using mutate() by creating a new fast_trial column to indicate trials where the response was implausibly fast (e.g., < 100 ms).
Try doing this with a simple logical test of whether latency < 100. You can do this with or without using the ifelse() function.
Code
data_amp_test_trials_with_fast_trials <- data_amp_test_trials %>%mutate(fast_trial =ifelse(test = rt_ms <100,yes =TRUE,no =FALSE))# more briefly but less explicitlydata_amp_test_trials_with_fast_trials <- data_amp_test_trials %>%mutate(fast_trial = rt_ms <100)
6.8.3 Practice mutate() & learn ifelse()
Use mutate() to remove weird values from data_demographics_trimmed$response, for the rows referring to age, that aren’t numbers.
What function could you use to first determine what values are present in this column, to know which could be retained or changed?
In simple cases like this, you can use mutate() and ifelse() to change impossible values to NA.
Code
# what values are present?data_demographics_trimmed %>%filter(item =="age") %>%count(response)
Use mutate() to remove weird values from data_selfreport_trials$response that aren’t Likert responses.
First determine what values are present in this column.
Use ifelse() and %in% inside mutate() to change values other than the Likert responses to NA.
If you struggle to do this: practice writing ‘pseudocode’ here. That is, without knowing the right code, explain in precise logic what you want the computer to do. This can be converted to R more easily.
Code
# what values are present?data_selfreport_trials %>%count(response)
# what type of data is the response column?class(data_selfreport_trials$response)
[1] "character"
Code
# remove non Likert valuesdata_selfreport_tidy <- data_selfreport_trials %>%mutate(response =ifelse(response =="Ctrl+'B'", NA_integer_, response),response =as.numeric(response))# show the data after changesdata_selfreport_tidy %>%count(response)
What other ways are there of implementing this mutate, e.g., without using %in%? What are the pros and cons of each?
Code
# write examples here
6.8.5 Practice mutate() & learn case_when()
case_when() allows you to compare multiple logical tests or if-else tests.
The AMP data needs to be reverse scored. Just like an item on a self-report that is worded negatively (e.g., most items: I am a good person; some items: I am a bad person), the negative prime trials have the opposite ‘accuracy’ values that they should. Use mutate() and case_when() to reverse score the negative prime trials, so that what was 0 is now 1 and what was 1 is now 0.
Code
# in your own time later, see if you can rewrite this yourself without looking at the answer to practice using case_whendata_amp_tidy <- data_amp_test_trials_with_fast_trials %>%mutate(correct =case_when(trial_type =="prime_positive"~ correct, trial_type =="prime_negative"& correct ==0~1, trial_type =="prime_negative"& correct ==1~0))# you can also specify a default value to return if none of the logical tests are passed with 'TRUE ~':data_amp_tidy <- data_amp_test_trials_with_fast_trials %>%mutate(correct =case_when(trial_type =="prime_negative"& correct ==0~1, trial_type =="prime_negative"& correct ==1~0,TRUE~ correct))
6.9 Summarizing across rows
It is very common that we need to create summaries across rows. For example, to create the mean and standard deviation of a column like age. This can be done with summarize(). Remember: mutate() creates new columns or modifies the contents of existing columns, but does not change the number of rows. Whereas summarize() reduces a data frame down to one row.
# mean and SD with rounding, illustrating how multiple summarizes can be done in one function calldata_demographics_age_tidy %>%summarize(mean_age =mean(age, na.rm =TRUE),sd_age =sd(age, na.rm =TRUE)) |>mutate(mean_age =round_half_up(mean_age, digits =2),sd_age =round_half_up(sd_age, digits =2))
Often, we don’t want to reduce a data frame down to a single row / summarize the whole dataset, but instead we want to create a summary for each (sub)group. For example
Code
# # this code creates data needed for this example - you can simply load the data from disk and skip over this commented-out code. we will come back to things like 'joins' later# data_demographics_unique_participant_codes <- data_demographics_trimmed %>%# count(unique_id) %>%# filter(n == 2)# # data_demographics_age_gender_tidy <- data_demographics_trimmed %>%# semi_join(data_demographics_unique_participant_codes, by = "unique_id") %>%# pivot_wider(names_from = "item",# values_from = "response") %>%# mutate(age = ifelse(age == "old", NA, age),# age = as.numeric(age),# gender = tolower(gender),# gender = stringr::str_remove_all(gender, regex("\\W+")), # regex is both very useful and awful to write# gender = case_when(gender == "female" ~ gender,# gender == "male" ~ gender,# gender == "nonbinary" ~ gender,# gender == "woman" ~ "female",# gender == "man" ~ "male"))# # dir.create("../data/processed")# write_csv(data_demographics_age_gender_tidy, "../data/processed/data_demographics_age_gender_tidy.csv")# load suitable example data from diskdata_demographics_age_gender_tidy <-read_csv("../data/processed/data_demographics_age_gender_tidy.csv")# illustrate use of group_by() and summarize()data_demographics_age_gender_tidy %>%summarize(mean_age =mean(age, na.rm =TRUE))
# summarize n per gender groupdata_demographics_age_gender_tidy %>%count(gender)
# A tibble: 4 × 2
gender n
<chr> <int>
1 female 40
2 male 53
3 nonbinary 3
4 <NA> 2
6.9.3 More complex summarizations
Like mutate, the operation you do to summarize can also be more complex, such as finding the mean result of a logical test to calculate a proportion. For example, the proportion of participants who are less than 25 years old:
You can also summarize (or indeed mutate) multiple columns in the same way using across(), for do-this-across-columns. We won’t cover how to use this here or all the variations that are possible, just know that it can be done. For example:
Code
# using the mtcars dataset that is built in to {dplyr}, ... mtcars %>%# ... calculate the mean of every numeric column in the dataset ...summarise(across(where(is.numeric), mean, na.rm =TRUE)) %>%# ... and then round every column to one decimal placemutate(across(everything(), round_half_up, digits =1))
mpg cyl disp hp drat wt qsec vs am gear carb
1 20.1 6.2 230.7 146.7 3.6 3.2 17.8 0.4 0.4 3.7 2.8
6.9.4 Realise that count() is just a wrapper function for summarize()
Code
dat <-data.frame(x =c(rnorm(n =50),rep(NA_integer_, 10)))dat |>mutate(x_is_na =is.na(x)) |>count(x_is_na)
x_is_na n
1 FALSE 50
2 TRUE 10
Code
dat |>summarise(n_na =sum(is.na(x)))
n_na
1 10
6.9.5 Practice using summarize()
Calculate the min, max, mean, and SD of all responses on the self report data.
# A tibble: 1 × 4
mean sd min max
<dbl> <dbl> <dbl> <dbl>
1 1.72 1.26 1 7
Currently each participant has up to three responses on the self-report scales (three item scale: like, positive, and prefer). Create a new dataframe containing each unique_id’s mean score across the items. Also calculate how many items each participant has data for, and whether they have complete data (i.e., data for three items).
Using only participants with complete, calculate the mean and SD of all participant’s mean scores on the self-reports.
Code
# data_selfreport_scored %>%
Create a new data frame that calculates the proportion of prime-congruent trials for each participant on the AMP (i.e., the mean of the ‘correct’ column), their proportion of too-fast trials, and their number of trials.
Also add to that data frame a new column called “exclude_amp” and set it to “exclude” if more than 10% of a participant’s trials are too-fast trials and “include” if not.
Code
# data_amp_scored <- data_amp_tidy %>%
Calculate the proportion of participants who are to be excluded.
Code
# data_amp_scored %>%
6.10 Check your learning
What is the difference between mutate() and summarize()? If I use the wrong one, will I get the same answer? E.g., mutate(mean_age = mean(age, na.rm = TRUE)) vs. summarize(mean_age = mean(age, na.rm = TRUE))
---title: "Data transformation"format: html: toc: true toc_float: true code-fold: show code-tools: true self-contained: true---```{r}#| include: false# settings, placed in a chunk that will not show in the .html file (because include=FALSE) # disables scientific notation so that small numbers appear as eg "0.00001" rather than "1e-05"options(scipen =999) ```## Dependencies and data**\[\[This data comes from a real study on implicit and self-reported evaluations. The implementation of the procedure produced three data files: one for the demographics data, one for the self-reported evaluations, and one for the implicit measure (the 'Affect Misattribution Procedure'). This script uses each of these to learn and practice functions from the readr, dplyr, and tidyr libraries that are commonly used for data wrangling. In doing so, we will learn how to do many of the steps involved in data processing for a given experiment.\]\]**```{r}library(dplyr)library(tidyr)library(readr)library(janitor) # for clean_names() and round_half_up()library(roundwork) # for round_up()library(stringr)library(knitr) # for kable()library(kableExtra) # for kable_classic()# demographics datadata_demographics_raw <-read_csv(file ="../data/raw/data_demographics_raw.csv") # self report measure datadata_selfreport_raw <-read_csv(file ="../data/raw/data_selfreport_raw.csv") # affect attribution procedure datadata_amp_raw <-read_csv(file ="../data/raw/data_amp_raw.csv")# clean column namesdata_demographics_clean_names <- data_demographics_raw %>%clean_names() data_selfreport_clean_names <- data_selfreport_raw %>%clean_names() data_amp_clean_names <- data_amp_raw %>%clean_names() ```## Renaming columnsOften variable names are not intuitive. An early step in any data wrangling is to make them more intuitive.Rename the self reports and AMP data too.```{r}data_demographics_renamed <- data_demographics_clean_names %>%rename(unique_id = subject,item = trialcode,rt_ms = latency) data_selfreport_renamed <- data_selfreport_clean_names %>%rename(unique_id = subject,item = trialcode,rt_ms = latency) data_amp_renamed <- data_amp_clean_names %>%rename(unique_id = subject,block_type = blockcode,trial_type = trialcode,trial_id = blocknum_and_trialnum,rt_ms = latency) ```## Selecting columnsNot all variables are useful to you. An early step in any data wrangling is to drop the columns that you don't need.Select the self reports and AMP data too.```{r}data_demographics_selected_columns <- data_demographics_renamed %>%select(unique_id, item, response)data_selfreport_selected_columns <- data_selfreport_renamed %>%select(unique_id, item, response, rt_ms)data_amp_selected_columns <- data_amp_renamed %>%select(unique_id, # methods variables block_type, trial_type, trial_id,# responses rt_ms, correct)```### More flexible selecting```{r}dat <-data.frame(var_1_1 =rnorm(n =100),var_1_2 =rnorm(n =100),var_1_3 =rnorm(n =100),var_1_4 =rnorm(n =100),var_1_5 =rnorm(n =100),var_2_1 =rnorm(n =100),var_2_2 =rnorm(n =100),var_2_3 =rnorm(n =100),var_2_4 =rnorm(n =100),var_2_5 =rnorm(n =100))dat |>select(starts_with("var_1")) dat |>select(ends_with("var_1")) dat |>select(contains("_1_")) ```## Practice the pipe againCombine the above function calls using pipes. Notice how this involves fewer objects in your environment, and therefore less potential for confusion or error.Remember: this is how we solve coding problems: break them down into smaller tasks and problems, get each of them working individually, then combine them together again. When you only see the end product, it's easy to think the author simply wrote the code as you see it, when they often wrote much more verbose chunks of code and then combined them together.Rewrite the rename and select calls for the AMP and self report data too.```{r}# remove all objects in environmentrm(list =ls())data_demographics_trimmed <-# read in the dataread_csv("../data/raw/data_demographics_raw.csv") %>%# convert to snake caseclean_names() %>%# make names more intuitiverename(unique_id = subject,item = trialcode) %>%# retain only columns of interestselect(unique_id, item, response)data_selfreport_trimmed <-read_csv("../data/raw/data_selfreport_raw.csv") %>%clean_names() %>%rename(unique_id = subject,item = trialcode) %>%select(unique_id, item, response)data_amp_trimmed <-read_csv("../data/raw/data_amp_raw.csv") %>%clean_names() %>%rename(unique_id = subject,block_type = blockcode,trial_type = trialcode,trial_id = blocknum_and_trialnum,rt_ms = latency) %>%select(unique_id, # methods variables block_type, trial_type, trial_id,# responses rt_ms, correct)```## Counting frequenciesAfter renaming and selecting columns, we know what columns we have. But what rows do we have in each of these? What might we need to exclude, change, work with in some way later on? It is very useful to use `count()` to obtain the frequency of each unique value of a given column```{r}data_demographics_trimmed %>%count(item)data_demographics_trimmed %>%count(response)``````{r}data_selfreport_trimmed %>%count(item)data_selfreport_trimmed %>%count(response)``````{r}data_amp_trimmed %>%count(trial_type)data_amp_trimmed %>%count(block_type)data_amp_trimmed %>%count(correct)data_amp_trimmed %>%count(rt_ms)```### Frequncies of sets of columnsNote that it is also possible to use count to obtain the frequencies of sets of unique values across columns, e.g., unique combinations of item and response.```{r}data_demographics_trimmed %>%count(item)data_demographics_trimmed %>%count(response)data_demographics_trimmed %>%count(item, response)```It can be useful to arrange the output by the frequencies.```{r}data_demographics_trimmed %>%count(item, response) %>%arrange(desc(n)) # arrange in descending order```## Filtering rowsOnce we know the contents of our columns, we may wish to exclude some rows using `filter()`.You can specify the logical test for filtering in many ways, including equivalence (`==`), negation (`!=`), or membership (`%in%`). It is often better to define what you *do* want (using equivalence or membership) rather than what you *do not* want (negation), as negations are less robust to new data with weird values you didn't think of when you wrote the code. E.g., you could specify `gender != "non-binary"` but this would not catch `non binary`. If you were for example looking to include only men and women, instead use `gender %in% c("man", "woman")`.\*\*\[This is just an example; there is usually no good a priori reason to exclude gender diverse participants\]```{r}# example using equivalenceexample_equivalence <- data_amp_trimmed %>%filter(block_type =="test")# example using negationexample_negation <- data_selfreport_trimmed %>%filter(item !="instructions")# example using membershipexample_membership <- data_selfreport_trimmed %>%filter(item %in%c("positive", "prefer", "like"))```### Multiple criteria, 'and' or 'or' combinationsYou can also have multiple criteria in your filter call, both of which have to be met (x `&` y), or either one of which have to be met (x `|` y).```{r}example_multiple_criteria_1 <- data_amp_trimmed %>%filter(block_type !="test"& correct ==1)example_multiple_criteria_2 <- data_amp_trimmed %>%filter(block_type !="test"| correct ==1)# note that these provide different results - make sure you understand whyidentical(example_multiple_criteria_1, example_multiple_criteria_2)```### Practice filteringFilter the self reports data frame to remove the instructions. Filter the AMP data frame to remove the practice blocks and the instruction trials.```{r}data_selfreport_trials <- data_selfreport_trimmed %>%#filter(item != "instructions")filter(item %in%c("positive", "prefer", "like"))# this probably contains things we don't wantdata_amp_trimmed %>%count(trial_type, block_type)# we exclude themdata_amp_test_trials <- data_amp_trimmed %>%filter(block_type =="test") %>%filter(trial_type !="instructions")# check they are excludeddata_amp_test_trials %>%count(trial_type, block_type)```### More flexible filteringReturn rows with exactly this contents```{r}data_amp_test_trials |>filter(trial_id =="A") # ```Return rows containing contents but not exactly it```{r}library(stringr)test <-c("A", "AB", "B")test =="A"str_detect(test, "A")str_detect(test, "B")data_amp_test_trials |>filter(str_detect(trial_id, "2_")) ```#### Multiple logical tests```{r}# "|" = OR# "&" = ANDdata_amp_test_trials |>filter(str_detect(trial_id, "2_") &str_detect(trial_id, "3_"))data_amp_test_trials |>mutate(rt_ms =ifelse(str_detect(trial_id, "2_"), rt_ms+100, rt_ms))```## Check your learningWhat is the difference between select and filter?Which is for rows and which is for columns?## Mutating: creating new columns or changing the contents of existing ones### Understanding `mutate()``mutate()` is used to create new columns or to change the contents of existing ones.```{r}# mutating new variablesexample_1 <- data_amp_test_trials %>%mutate(latency_plus_1 = rt_ms +1)example_2 <- data_amp_test_trials %>%mutate(log_latency =log(rt_ms))# mutating the contents of existing variablesexample_3 <- data_amp_test_trials %>%mutate(rt_s = rt_ms /1000) # latency is now in seconds rather than milliseconds```The operations inside mutate can range from the very simple, like the above, to much more complex. The below example uses other functions we haven't learned yet. For now, just notice that there can be multiple mutate calls and they can produce a cleaned up gender variable.```{r}# illustrate the problem with the gender responses:data_demographics_trimmed %>%# filter only the gender item, not agefilter(item =="gender") %>%count(response) %>%arrange(desc(n))# clean up the gender variabledata_demographics_gender_tidy_1 <- data_demographics_trimmed %>%# filter only the gender item, not agefilter(item =="gender") %>%# change the name of the response variable to what it now represents: genderrename(gender = response) %>%# change or remove weird responses to the gender questionmutate(gender =str_to_lower(gender)) %>%mutate(gender =str_remove_all(gender, "[\\d.]")) %>%# remove everything except lettersmutate(gender =na_if(gender, "")) %>%mutate(gender =case_when(gender =="woman"~"female", gender =="man"~"male", gender =="girl"~"female", gender =="yes"~NA_character_, gender =="dude"~"male", gender =="non binary"~"non-binary",TRUE~ gender)) %>%# select only the columns of interestselect(unique_id, gender)# illustrate the data after cleaning:data_demographics_gender_tidy_1 %>%count(gender) %>%arrange(desc(n))```A single mutate call can contain multiple mutates. The code from the last chunk could be written more simply like this:```{r}# clean up the gender variabledata_demographics_gender_tidy_2 <- data_demographics_trimmed %>%# filter only the gender item, not agefilter(item =="gender") %>%# change the name of the response variable to what it now represents: genderrename(gender = response) %>%# change or remove weird responses to the gender questionmutate(gender =str_to_lower(gender),gender =str_remove_all(gender, "[\\d.]"), # remove everything except lettersgender =na_if(gender, ""), gender =case_when(gender =="woman"~"female", gender =="man"~"male", gender =="girl"~"female", gender =="yes"~NA_character_, gender =="dude"~"male", gender =="non binary"~"non-binary",TRUE~ gender)) %>%# select only the columns of interestselect(unique_id, gender)# check they are identicalidentical(data_demographics_gender_tidy_1, data_demographics_gender_tidy_2)```### Practice `mutate()`When analyzing cognitive behavioral tasks, it is common to employ mastery criteria to exclude participants who have not met or maintained some criterion within the task. We'll do the actual exclusions etc. later on, but for practice using `mutate()` by creating a new `fast_trial` column to indicate trials where the response was implausibly fast (e.g., \< 100 ms).Try doing this with a simple logical test of whether latency \< 100. You can do this with or without using the `ifelse()` function.```{r}data_amp_test_trials_with_fast_trials <- data_amp_test_trials %>%mutate(fast_trial =ifelse(test = rt_ms <100,yes =TRUE,no =FALSE))# more briefly but less explicitlydata_amp_test_trials_with_fast_trials <- data_amp_test_trials %>%mutate(fast_trial = rt_ms <100)```### Practice `mutate()` & learn `ifelse()`Use `mutate()` to remove weird values from `data_demographics_trimmed$response`, for the rows referring to age, that aren't numbers.What function could you use to first determine what values are present in this column, to know which could be retained or changed?In simple cases like this, you can use `mutate()` and `ifelse()` to change impossible values to `NA`.```{r}# what values are present?data_demographics_trimmed %>%filter(item =="age") %>%count(response) # fix them with mutatedata_demographics_age_tidy <- data_demographics_trimmed %>%filter(item =="age") %>%mutate(response =ifelse(test = response =="old",yes =NA_integer_,no = response)) %>%mutate(response =as.numeric(response)) %>%rename(age = response)# check this has fixed the issuedata_demographics_age_tidy %>%count(age)```### Practice `mutate()` & `ifelse()`Use `mutate()` to remove weird values from `data_selfreport_trials$response` that aren't Likert responses.First determine what values are present in this column.Use `ifelse()` and `%in%` inside `mutate()` to change values other than the Likert responses to `NA`.**If you struggle to do this: practice writing 'pseudocode' here. That is, without knowing the right code, explain in precise logic what you want the computer to do. This can be converted to R more easily.**```{r}# what values are present?data_selfreport_trials %>%count(response)# what type of data is the response column?class(data_selfreport_trials$response)# remove non Likert valuesdata_selfreport_tidy <- data_selfreport_trials %>%mutate(response =ifelse(response =="Ctrl+'B'", NA_integer_, response),response =as.numeric(response))# show the data after changesdata_selfreport_tidy %>%count(response)class(data_selfreport_tidy$response)```What other ways are there of implementing this mutate, e.g., without using `%in%`? What are the pros and cons of each?```{r}# write examples here```### Practice `mutate()` & learn `case_when()``case_when()` allows you to compare multiple logical tests or if-else tests.The AMP data needs to be reverse scored. Just like an item on a self-report that is worded negatively (e.g., most items: I am a good person; some items: I am a bad person), the negative prime trials have the opposite 'accuracy' values that they should. Use `mutate()` and `case_when()` to reverse score the negative prime trials, so that what was 0 is now 1 and what was 1 is now 0.```{r}# in your own time later, see if you can rewrite this yourself without looking at the answer to practice using case_whendata_amp_tidy <- data_amp_test_trials_with_fast_trials %>%mutate(correct =case_when(trial_type =="prime_positive"~ correct, trial_type =="prime_negative"& correct ==0~1, trial_type =="prime_negative"& correct ==1~0))# you can also specify a default value to return if none of the logical tests are passed with 'TRUE ~':data_amp_tidy <- data_amp_test_trials_with_fast_trials %>%mutate(correct =case_when(trial_type =="prime_negative"& correct ==0~1, trial_type =="prime_negative"& correct ==1~0,TRUE~ correct))```## Summarizing across rowsIt is very common that we need to create summaries across rows. For example, to create the mean and standard deviation of a column like age. This can be done with `summarize()`. Remember: `mutate()` creates new columns or modifies the contents of existing columns, but does not change the number of rows. Whereas `summarize()` reduces a data frame down to one row.```{r}# meandata_demographics_age_tidy %>%summarize(mean_age =mean(age, na.rm =TRUE))# SDdata_demographics_age_tidy %>%summarize(sd_age =sd(age, na.rm =TRUE))# mean and SD with rounding, illustrating how multiple summarizes can be done in one function calldata_demographics_age_tidy %>%summarize(mean_age =mean(age, na.rm =TRUE),sd_age =sd(age, na.rm =TRUE)) |>mutate(mean_age =round_half_up(mean_age, digits =2),sd_age =round_half_up(sd_age, digits =2))```### `group_by()`Often, we don't want to reduce a data frame down to a single row / summarize the whole dataset, but instead we want to create a summary for each (sub)group. For example```{r}# # this code creates data needed for this example - you can simply load the data from disk and skip over this commented-out code. we will come back to things like 'joins' later# data_demographics_unique_participant_codes <- data_demographics_trimmed %>%# count(unique_id) %>%# filter(n == 2)# # data_demographics_age_gender_tidy <- data_demographics_trimmed %>%# semi_join(data_demographics_unique_participant_codes, by = "unique_id") %>%# pivot_wider(names_from = "item",# values_from = "response") %>%# mutate(age = ifelse(age == "old", NA, age),# age = as.numeric(age),# gender = tolower(gender),# gender = stringr::str_remove_all(gender, regex("\\W+")), # regex is both very useful and awful to write# gender = case_when(gender == "female" ~ gender,# gender == "male" ~ gender,# gender == "nonbinary" ~ gender,# gender == "woman" ~ "female",# gender == "man" ~ "male"))# # dir.create("../data/processed")# write_csv(data_demographics_age_gender_tidy, "../data/processed/data_demographics_age_gender_tidy.csv")# load suitable example data from diskdata_demographics_age_gender_tidy <-read_csv("../data/processed/data_demographics_age_gender_tidy.csv")# illustrate use of group_by() and summarize()data_demographics_age_gender_tidy %>%summarize(mean_age =mean(age, na.rm =TRUE))data_demographics_age_gender_tidy %>%group_by(gender) %>%summarize(mean_age =mean(age, na.rm =TRUE))```### `n()``n()` calculates the number of rows, i.e., the N. It can be useful in summarize.```{r}# summarize ndata_demographics_age_gender_tidy %>%summarize(n_age =n())# summarize n per gender groupdata_demographics_age_gender_tidy %>%group_by(gender) %>%summarize(n_age =n())```Note that `count()` is just the combination of group_by() and summiarize() and n()! they produce the same results as above.```{r}# summarize ndata_demographics_age_gender_tidy %>%count()# summarize n per gender groupdata_demographics_age_gender_tidy %>%count(gender)```### More complex summarizationsLike mutate, the operation you do to summarize can also be more complex, such as finding the mean result of a logical test to calculate a proportion. For example, the proportion of participants who are less than 25 years old:```{r}data_demographics_age_tidy %>%summarize(proportion_less_than_25 =mean(age <25, na.rm =TRUE)) %>%mutate(percent_less_than_25 =round_half_up(proportion_less_than_25 *100, 1))```You can also summarize (or indeed mutate) multiple columns in the same way using `across()`, for do-this-across-columns. We won't cover how to use this here or all the variations that are possible, just know that it can be done. For example:```{r}# using the mtcars dataset that is built in to {dplyr}, ... mtcars %>%# ... calculate the mean of every numeric column in the dataset ...summarise(across(where(is.numeric), mean, na.rm =TRUE)) %>%# ... and then round every column to one decimal placemutate(across(everything(), round_half_up, digits =1))```### Realise that `count()` is just a wrapper function for `summarize()````{r}dat <-data.frame(x =c(rnorm(n =50),rep(NA_integer_, 10)))dat |>mutate(x_is_na =is.na(x)) |>count(x_is_na)dat |>summarise(n_na =sum(is.na(x)))```### Practice using `summarize()`Calculate the min, max, mean, and SD of all responses on the self report data.```{r}data_selfreport_tidy %>%summarize(mean =mean(response, na.rm =TRUE),sd =sd(response, na.rm =TRUE),min =min(response, na.rm =TRUE),max =max(response, na.rm =TRUE))```Currently each participant has up to three responses on the self-report scales (three item scale: like, positive, and prefer). Create a new dataframe containing each unique_id's mean score across the items. Also calculate how many items each participant has data for, and whether they have complete data (i.e., data for three items).```{r}data_selfreport_scored <- data_selfreport_tidy %>%group_by(unique_id) %>%summarize(mean_self_report =mean(response),n_self_report_items =n()) %>%mutate(self_report_complete = n_self_report_items ==3)# test <- c(3, 5, 7, NA)# #test <- c(3, 5, 7)# mean(test)# mean(test, na.rm = TRUE)# # dat |># summarize(mean = mean(response, na.rm = TRUE))# # dat |># filter(!is.na(response)) |># summarize(mean = mean(response))# # mean_not_dumb <- function(x){mean(x, na.rm = TRUE)}```Using only participants with complete, calculate the mean and SD of all participant's mean scores on the self-reports.```{r}# data_selfreport_scored %>%```Create a new data frame that calculates the proportion of prime-congruent trials for each participant on the AMP (i.e., the mean of the 'correct' column), their proportion of too-fast trials, and their number of trials.Also add to that data frame a new column called "exclude_amp" and set it to "exclude" if more than 10% of a participant's trials are too-fast trials and "include" if not.```{r}# data_amp_scored <- data_amp_tidy %>%```Calculate the proportion of participants who are to be excluded.```{r}# data_amp_scored %>%```## Check your learningWhat is the difference between `mutate()` and `summarize()`? If I use the wrong one, will I get the same answer? E.g., mutate(mean_age = mean(age, na.rm = TRUE)) vs. summarize(mean_age = mean(age, na.rm = TRUE))## Writing data to disk```{r}# write_csv(data_processed, "../data/processed/data_processed.csv")```